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Abstract 

The Binary Jumbled String Matching problem is defined as: Given a string s over {a, b} of length 
n and a query (x,y), with x,y non-negative integers, decide whether s has a substring t with 
exactly x a's and y Vs. Previous solutions created an index of size 0(n) in a pre-processing step, 
which was then used to answer queries in constant time. The fastest algorithms for construction 
of this index have running time 0(n 2 / log n) [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 
2010], or 0(n 2 / log 2 n) in the word- RAM model [Moosa and Rahman, JDA 2012]. We propose an 
index constructed directly from the run- length encoding of s. The construction time of our index 
is 0{n + p 2 log p), where O(n) is the time for computing the run-length encoding of s and p is the 
length of this encoding — this is no worse than previous solutions if p = 0(n/ log n) and better if 
p = o{n/ log n). Our index L can be queried in O(logp) time. While \L\ = 0(min(ra, p 2 )) in the 
worst case, preliminary investigations have indicated that \L\ may often be close to p. Furthermore, 
the algorithm for constructing the index is conceptually simple and easy to implement. In an 
attempt to shed light on the structure and size of our index, we characterize it in terms of the 
prefix normal forms of s introduced in [Fici and Liptak, DLT 2011]. 
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1. Introduction 

Binary jumbled string matching is denned as follows: Given a string s over {a, 6} and a query 
vector (x, y) of non-negative integers x and y, decide whether s has a substring containing exactly 
x o's and y 6's. If this is the case, we say that (x,y) occurs in s. The Parikh set of s, n(s), is the 
set of all vectors occurring in s. 

For one query, the problem can be solved optimally by a simple sliding window algorithm in 
0(n) time, where n is the length of the text. Here we are interested in the indexing variant where 
the text is fixed, and we expect a large number of queries. Recently, this problem and its variants 
have generated much interest [H [H [2 El [31 [5] . The crucial observation is based on the following 
property of binary strings: 

Interval Lemma. (J^j) Given a string s over £ = {a, b}, \s\ = n. For every m £ {1, ...,n}: 

if, for some x < x' , both (x, m — x) and (x', m — x') occur in s, then so does (z, m — z) for all z, 
x < z < x' . 

It thus suffices to store, for every query length m, the minimum and maximum number of o's in 
all m- length substrings of s. This information can be stored in a linear size index, and now any 
query of the form (x, y) can be answered by looking up whether x lies between the minimum and 
maximum number of a's for length m = x + y. The query time is proportional to the time it takes 
to find x + y in the index, which is constant in most implementations. 

This index can be constructed naively in 0(n 2 ) time. In [T] and independently in [7J, con- 
struction algorithms were presented with running time 0(n 2 / log n), using reduction to min-plus 
convolution. In the word- RAM model, the running time can again be reduced to 0(n 2 / log 2 n), 
using bit-parallelism [§J. More recently, a Monte Carlo algorithm with running time 0(n 1+e ) was 
introduced [5], which constructs an approximate index allowing one-sided errors, with the proba- 
bility of an incorrect answer depending on the choice of e. 

Any binary string s can be uniquely written in the form s = a Ul b Vl a U2 b V2 ■ ■ ■ a Ur b Vr , where the 
Ui,Vi are non- negative integers, all non-zero except possibly ui and v r . The run-length encoding 
of s is then defined as rle(s) = (ui, v\, U2, i>2, • • ■ ,u r , v r ). This representation is often used to 
compress strings, especially in domains where long runs of characters occur frequently, such as the 
representation of digital images, multimedia databases, and time series. 

In this paper, we present the Corner Index L which, for strings with good run-length compres- 
sion, is much smaller than the linear size index used by all previous solutions. It is constructed 
directly from the run-length encoding of s, in time 0(p 2 logp), where p = | rZe(s) | . The Corner In- 
dex has worst-case size min(n, p 2 ) (measured in the number of entries, which fit into two computer 
words). We pay for this with an increase in lookup time from 0(1) to 0(log \L\) = O(logp). 

In a recent paper [6], the prefix normal forms of a binary string were introduced. Given s of 
length n, PNF a (s) is the unique string such that, for every 1 < m < n, its m- length prefix has 
the same number of a's as the maximum number of a's in any m-length substring of s; PNFft(s) 
is defined analogously. It was shown in [6j that two strings s and t have the same Parikh set if 
and only if PNF a (s) = PNF a (t) and PNFf,(s) = PNFb(t). From this perspective, our index can be 
viewed as storing the run-length encodings of PNF a (s) and PNF;,(s). This allows us a fresh view 
on the problem, and may point to a promising way of proving bounds on the index size. Moreover, 
our algorithm constitutes an improvement both for the computation and the testing problems on 
prefix normal forms (see [B]) whenever rle(s) is short. 
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0123456789 



bmin(i) 0000224466 
bmax{i) 3355578999 



Table 1: Functions bmin and bmax for the string s = aabababbaaabbaabbb. 

The construction time of 0(n + p 2 logp), where 0(n) is for computing rle(s) and 0(p 2 log p) 
for constructing the Corner Index, is much better than the previous 0(n 2 j log n) time algorithms 
for strings with short run-length encodings, and no worse as long as p = 0{n/ log n). For strings 
with good run-length compression, the increase in lookup time from O(l) to 0(log \ L\) is justified 
in our view by the reduced size and construction time of the new index. Finally, our algorithm is 
conceptually simple and easy to implement. 

2. Preliminaries 

A binary string s = s\S2 ■ ■ ■ s n is a finite sequence of characters from {a, b}. We denote the 
length of s by For two strings s,t, we say that t is a substring of s if there are indices 
1 < i,j < |s| such that t = Si ■ ■ ■ Sj. If i = 1, then t is called a prefix of s. We denote by \s\ a (resp. 
\s\b) the number of a's (resp. 6's) in s. The Parikh vector of s is defined as p(s) = (\s\ a , \s\b)- We 
say that a Parikh vector q occurs in string s if s has a substring t such that pit) = q. The Parikh 
set of s, is the set of all Parikh vectors occurring in s. 

The Interval Lemma from the Introduction implies that, for any binary string s, there are 
functions F and / s.t. 

(x, y) occurs in s if and only if f(x + y) < x < F{x + y), (1) 

namely, for m = 0, . . . ,\s\, F(m) = max{x | (x, m — x) G II(s)} and f(m) = minjx | (x, m — x) G 
II(s)}. This can be stated equivalently in terms of the minimum and maximum number of 6's in 
all substrings containing a fixed number i of a's. Let us denote by bmin(i) (resp. bmax(i)) the 
minimum (resp. maximum) number of fo's in a substring containing exactly i a's. Then 

(x,y) occurs in s if and only if bmin(x) < y < bmax(x). (2) 

The table of functions F and / in ([!]) is the index used in most algorithms for Binary Jumbled 
String Matching [21 El E] ; while that of functions bmin and bmax in ^ was used in [3] . Even 
though the latter is always smaller, both are linear size in n. Note that one table can be computed 
from the other in linear time (e.g. bmin(i) = min{m | F(m) = i} — i). 

Example 1. Let s = aabababbaaabbaabbb. Then (3,3) occurs in s while (5,1) does not. We have 
F(6) = 4 and /(6) = 2, bmin(3) = and bmax(3) = 5. We give the full table of values of the two 
functions bmin and bmax in Table HJ 

3. The Corner Index 

In Fig. [TJ we plot both functions bmin and bmax for our example string. The x-axis denotes 
the number of a's and the y-axis the number of 6's. It follows from Q that the integer points 
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Figure 1: The bmin and bmax functions for the string s — aabababbaaabbaabbb. Representing binary strings as walks 
on the integer grid, s is indicated by a dashed line, while the functions bmin and bmax correspond to the prefix 
normal forms of s; see Sec. [4] for more details. 

within the shaded area correspond to the Parikh set of s. The crucial observation is: Since both 
functions bmin and bmax are monotonically increasing step functions, it is sufficient to store those 
points where they increase. These points are specially marked in Fig. [T] 

Example 2. In our example, these points are, for bmin: {(3, 0), (5, 2), (7, 4), (9, 6)}, and for bmax: 
{(0,3), (2, 5), (5, 7), (6, 8), (7, 9)}. 

Definition 1. We define the Corner Index for the Parikh set of a given binary string s as two 
ordered sets -L m i n and L milx , where 

L m in = {(i, bmin(i)) \ i = \s\ a or bmin{i) < bmin(i + 1)}, (3) 
L max = {(«, bmax(i)) \ i = or bmax(i) > bmax{i — 1)}. (4) 

The order is according to both components, since for any (x, y), (x' , y') S L m in (or G L max ), we 
have that x < x' if and only if y < y' . Now for any x, we can recover bmin(x) from L m [ n (resp. 
bmax(x) from L max ) by noting that 

bmin(x) = bmin(xR), xr = min{x' | x > x, 3y f : (x ,y ) G -£ m in}, (5) 

bmax(x) = bmax(xL), xl = max{x' | x < x,3y' : (x',y') £ I mM }. (6) 

3.1. Construction 

To construct the Corner Index, we will use the run-length encoding of s, rle(s) = 
(ui, vx, U2, V2, ■ ■ ■ , u r , v r ). We refer to maximal substrings consisting only of a's (resp. 6's) as a-runs 
(resp. 6-runs), and denote by p = |r/e(s)|, thus 2r — 2 < p < 2r. It follows directly from the 
definitions that 

(x, y) 6 II(s) =^ Vx' < x : bmin(x') < y and \/x > x : bmax(x ) > y. (7) 
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Algorithm Construct L 



1. input: rle(s) = (ui,vi,u 2 ,v 2 , . . . ,u r ,v r ) 

2. for k from 1 to r 

3. for i = 1 to r — k + 1 

4. (x, y) <- + + u i+k -i,Vi + . . . + v i+k - 2 ) 

5. if not L min > (x,y) 

6. then insert (x, y) into L m \ n 

7. for each (x',y') in L m ; n s.t. > (x',y'), 

8. delete (x',y') from L m i n 



Figure 2: The algorithm computing L m i n . 

Lemma 2. Ze£ s be a binary string and (x,y) G If(s). TTien i/iere exists a substring t of s which 
begins and ends with a full a-run such that p(t) = (xi, yi) and x\ > x,yi < y. Similarly, there is a 
substring t! of s which begins and ends with a full b-run such thatp(t') = (x 2 , y 2 ) and x 2 < x, y 2 > y. 

Proof. Let w = Sj • • • Sj be a substring of s such that = (x, y). If Sj = a, then extend to the 
left to the beginning of the a-run containing sf, if Sj = 6, then shrink w from the left to exclude all 
b's of the b- run containing Sj, likewise for Sj. The substring t so obtained fulfils the requirements. 
A substring t! can be found analogously by extending fo-runs and shrinking a-runs. □ 

Lemma [2j together with ([7]), implies that it suffices to compute substrings beginning and ending 
with full a-runs (for L m i n ) and beginning and ending with full b- runs (for L max ). The algorithm 
generates the Parikh vectors of these substrings one by one, inspects them and incrementally 
constructs L m \ n and L m ax- For brevity of exposition, we only give the algorithm for constructing 
Lmin'-, imax can be computed simultaneously. We need the following definition: 

Definition 3. Let (x, y), (a/, y') £ n(s). We say that (x,y) dominates (x',y'), denoted (x,y) > 
(x',y'), if (x,y) / (x',y'), x > x' and y < y' . For X C U(s),(x,y) G n(s), define X > (x,?/) iff 
exists (x',y') £ X s.t. (x',y') > (x,y). 

Since > is irreflexive and transitive, it is a (strict) partial order. Note that L m i n is the set of 
maximal elements in the poset (n(s), >). (Another relation ► can be defined analogously s.t. the 
set of maximal elements equals L max .) 

We present the algorithm computing the index in Fig. [2] Recall that Uj (resp. Vi) is the 
length of the i-th a-run (resp. 6-run) of s. We compute, for every interval size k > 1, and every 
1 < i < r — k + 1, the Parikh vector (x, y) of the substring starting with the iih a-run and spanning 
k a-runs, and query whether (x,y) is dominated by some element in L m ; n . Note that this is the 
case if and only if (x,y) is dominated by the unique pair (x',y') where x' = xr from If no 
element of L m \ n dominates (x,y), then it is added to L aua , and all elements of L m i n which (x,y) 
dominates are removed from the list. These can be found consecutively in decreasing order from 
the position where (x, y) was inserted. The algorithm is illustrated in Fig. [3] on our example string. 
Top left gives the run-length encoding of s, with a-runs in the first row, and 6-runs in the second. 
On the right we list all pairs which need to be inspected, and on the left bottom the final list L mrn . 
Elements which are inserted into L m j n and later removed are struck through. 
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1 3 2 

2 2 3 



{2^}, (3,0),{4^>, (5,2), 
{#y4>,(7, 4), (9, 6) 

Figure 3: Computation of L 



(2.0) (1,0) (1,0) (3,0) (2,0) 

(3.1) (2,1) (4, 2) (5, 2) 

(4. 2) (5, 3) (6, 4) 
(7, 4) (7, 5) 
(9,6) 

! for the example s — aabababbaaabbaabbb. 



3.2. Analysis 

The number of entries of each list is upper bounded by min{|s| a , \s\b, C*^ 1 )}, thus the total size 
of the Corner Index is 0(min(n, p 2 )). The query time is 0(log \L\) = Oilogp). 

The working space of the construction algorithm is the maximum size L m i n reaches during the 
algorithm, which is at most f^ 1 ) = 0(p 2 ). For the construction time, note that 0(p 2 ) pairs have 
to be inspected. For each, we have to decide whether it is dominated by an element in L m i n ; this 
query amounts to finding xr from ^ in -L m in, in O(logp) time. Insertion of an element can cause 
more than one deletion in the list; however, since each element is deleted at most once, we have 
amortized time 0(logp) per element, and thus altogether 0(p 2 log p) time for the construction 
algorithm. 

Note that L m i n can be constructed by inspecting the ( r '£ 1 ) pairs in an arbitrary order, although 
our bound on the construction time assumes that the pairs are generated in constant time. We 
summarize: 

Theorem 4. Queries for the Binary Jumbled String Matching problem can be answered in 0(log p) 
time, using an index of size 0(min(p 2 , n)), where n is the length of the text and p the length of its 
run-length encoding. The index can be constructed in 0(n + p 2 \ogp) time from the string s. 



4. Prefix Normal Forms 

We recall the definitions of rank and select (cf. [9]). Given a binary string s, we denote, for 
c G {a, b}, by rank c (s,i) = |si---Sj| c , the number of c's in the prefix of length i of s, and by 
select c (s,i) the position of the i'th c in s, i.e., select c (s,i) = min{A; : \si ■ ■ ■ Sk\ c = i}- 

It is possible [S] to associate to any binary string s a unique string s' such that for all < i < \s\, 
F s (i) = F s /(i) = rank a (s',i), i.e., for any length i, the number of a's in the prefix of s' of length 
i equals the maximum number of a's in any substring of s of length i. The string s' is called 
the prefix normal form of s with respect to a, denoted PNF a (s); the prefix normal form w.r.t. b, 
PNFf,(s), is defined analogously. 

Example 3. For our string s = aabababbaaabbaabbb, the prefix normal forms are 

PNF a (s) = aaabbaabbaabbaabbb, and (8) 
PNFfc(s) = bbbaabbaaabbababaa. (9) 

By definition, bmin(i) is the minimum number of Vs in a prefix of PNF a (s) containing exactly 
i a's, and bmax{i) the maximum number of 6's in a prefix of PNFb(s) containing exactly i a's. So 
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we have: 



F(i) = rank a (PNF a (s),i) for < % < \s\, (10) 

bmin(i) = select a (PNF a (s),i) —i for < i < |s| a ,and (11) 

bmax(i) = { sdec ^( PNF V s )> * + !)-(* + !) for < i < | S | a , ^ 
\\s\b for i = \s\ a . 

In fact, if we represent binary strings by drawing a horizontal unit line segment for each a and 
a vertical one for each b, then PNF a (s) is represented by function bmin, and PNFft(s) by function 
bmax, see Fig. [Tj 

Moreover, the run- length encoding of PNF a (s) contains the same information as the list L m i n 
output by our algorithm: Indeed, let r/e(PNF a (s)) = (u^, v[, u' 2 , v' 2 , . . . , u' r ,, v' r ,). Then, setting 

Pm = YT=l U i and Qrn = YT=l V 'f OIle naS 

£min = {(Pm, q m -l) \ m = 1, . . . , /}. (13) 

In particular, |L m j n | = ||r/e(PNF a (s))|, and this gives a bound on the size of the output in 
terms of the prefix normal form. 



5. Open problems 

We conclude with some open problems. First, we are interested in tighter bounds on the size 
of the Corner Index in terms of p, the length of the run-length encoding of the input string — our 
preliminary experiments on random strings suggest that the size of the index may often be close 
to p. Second, how much working space is required by our algorithm: in our experiments it was 
rare for the maximal size of the index during construction to exceed the final index size. Hopefully 
this working space can be bounded by making use of the structure of the posets (Il(s),l>) and 
(n(s), ►) introduced in Sec. |3j Third, does the number of maximal pairs in these posets (which 
is the total length of the run-length encodings of the two PNFs) constitute a lower bound on the 
size of any index for the Binary Jumbled String Matching problem? Better understanding these 
posets could also lead to an improvement of our algorithm's running time: if we could characterize 
maximal pairs, it may no longer be necessary to inspect all possible pairs. 
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